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Abstract. We review recent results for high-dimensional sparse linear 
regression in the practical case of unknown variance. Different sparsity 
settings are covered, including coordinate-sparsity, group-sparsity and 
variation-sparsity. The emphasis is put on non-asymptotic analyses and 
feasible procedures. In addition, a small numerical study compares the 
practical performance of three schemes for tuning the Lasso estima- 
tor and some references are collected for some more general models, 
including multivariate regression and nonparametric regression. 
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1. INTRODUCTION 

In the present paper, we mainly focus on the linear regression model 

y = X/3o + e, (1) 

where y is a n-dimensional response vector, X is a fixed n x p design matrix, 
and the vector e is made of n i.i.d Gaussian random variables with AA(0,fT^) 
distribution. In the sequel, X*^*^ stands for the i-th row of X. Our interest is on 
the high-dimensional setting, where the dimension p of the unknown parameter 
/3o is large, possibly larger than n. 

The analysis of the high-dimensional linear regression model has attracted 
a lot of attention in the last decade. Nevertheless, there is a longstanding gap 
between the theory where the variance o"^ is generally assumed to be known and 
the practice where it is often unknown. The present paper is mainly devoted 
to review recent results on linear regression in high-dimensional settings with 
unknown variance cr^. A few additional results for multivariate regression and 
the nonparametric regression model 

yi = /(X«)+ei, i = l,...,n, (2) 

will also be mentioned. 
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1.1 Sparsity assumptions 

In a high-dimensional hnear regression model, accurate estimation is unfeasible 
unless it relies on some special properties of the parameter /3o • The most common 
assumption on /3o is that it is sparse in some sense. We will consider in this paper 
the three following classical sparsity assumptions. 

Coordinate-sparsity. Most of the coordinates of /3o are assumed to be zero (or 
approximately zero). This is the most common acceptation for sparsity in linear 
regression. 

Structured-sparsity. The pattern of zero(s) of the coordinates of /3o is assumed 
to have an a priori known structure. For instance, in group-sparsity [80], the 
covariates are clustered into M groups and when the coefficient /3o,i corresponding 
to the covariate Xj (the i-th column of X) is non-zero, then it is likely that all 
the coefficients (Sqj with variables Xj in the same cluster as Xj are non-zero. 

Variation-sparsity. The p — 1-dimensional vector f^Q of variation of (3q is defined 
by f3^j = f3o,j+i — Poj- Sparsity in variation means that most of the components 
of /3q' are equal to zero (or approximately zero). When p = n and X = /„, 
variation-sparse linear regression corresponds to signal segmentation. 

1.2 Statistical objectives 

In the linear regression model, there are roughly two kinds of estimation objec- 
tives. In the prediction problem, the goal is to estimate X/3o, whereas in the inverse 
problem it is to estimate /3o. When the vector /3q is sparse, a related objective is 
to estimate the support of /3q (model identification problem) which is the set of 
the indices j corresponding to the non zero coefficients /3o j . Inverse problems and 
prediction problems are not equivalent in general. When the Gram matrix XX* 
is poorly conditioned, the former problems can be much more difficult than the 
latter. Since there are only a few results on inverse problems with unknown vari- 
ance, we will focus on the prediction problem, the support estimation problem 
being shortly discussed in the course of the paper. 

In the sequel, E^g [.] stands for the expectation with respect to y ~ AA(X/3o, cr^In) 
and ||.||2 is the euclidean norm. The prediction objective amounts to build esti- 
mators P so that the risk 

7^[A/3o]:=E^o[l|X(^-/3o)i] (3) 

is as small as possible. 

1.3 Approaches 

Most procedures that handle high-dimensional linear models [22, 26, 62, 72, 
73, 81, 83, 85] rely on tuning parameters whose optimal value depends on a. For 
example, the results of Bickel et al. [17] suggest to choose the tuning parameter A 
of the Lasso of the order of 2a\j2 log(p). As a consequence, all these procedures 
cannot be directly applied when is unknown. 

A straightforward approach is to replace ci^ by an estimate of the variance 
in the optimal value of the tuning parameter (s). Nevertheless, the variance is 
difficult to estimate in high- dimensional settings, so a plug-in of the variance does 
not necessarily yield good results. There are basically two approaches to build on 
this amount of work on high-dimensional estimation with known variance. 
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1. Ad-hoc estimation. There has been some recent work [16, 68, 71] to modify 
procedures Hke the Lasso in such a way that the tuning parameter does 
not depend anymore on (see Section 4.2). The challenge is to find a 
smart modification of the procedure, so that the resulting estimator /3 is 
computationally feasible and has a risk /3; /3o as small as possible. 

2. Estimator selection. Given a collection (/3A)AeA of estimators, the objective 
of estimator selection is to pick an index A such that the risk of is as 

small as possible; ideally as small as the risk 7^[/3a. ;/3o] of the so-called 
oracle estimator 

^A* := argmin TZ jSx] Po . (4) 

{/3a, AeA} 

Efficient estimator selection procedures can then be applied to tune the 
aforementioned estimation methods [22, 26, 62, 72, 73, 81, 83, 85]. Among 
the most famous methods for estimator selection, we mention 1^-fold cross- 
validation (Geisser [32]), AlC (Akaike [1]) and BIC (Schwarz [64]) criteria. 

The objective of this survey is to describe state-of-the-art procedures for high- 
dimensional linear regression with unknown variance. We will review both auto- 
matic tuning methods and ad-hoc methods. There are some procedures that we 
will let aside. For example, Baraud [11] provides a versatile estimator selection 
scheme, but the procedure is computationally intractable in large dimensions. Lin- 
ear or convex aggregation of estimators are also valuable alternatives to estimator 
selection when the goal is to perform estimation, but only a few theoretical works 
have addressed the aggregation problem when the variance is unknown [35, 33]. 
For these reasons, we will not review these approaches in the sequel. 

1.4 Why care about non-asymptotic analyses ? 

AlC [1], BIC [64] and V^-fold Cross- Validation [32] are probably the most popu- 
lar criteria for estimator selection. The use of these criteria relies on some classical 
asymptotic optimality results. These results focus on the setting where the collec- 
tion of estimators {f3x)xi^\ and the dimension p are fixed and consider the limit 
behavior of the criteria when the sample size n goes to infinity. For example, 
under some suitable conditions, Shibata [67], Li [53] and Shao [66] prove that the 
risk of the estimator selected by AlC or F-fold CV (with V = Vn ^ oo) is asymp- 
totically equivalent to the oracle risk TZ[(3x*; Pq]. Similarly, Nishii [59] shows that 
the BIC criterion is consistent for model selection. 

All these asymptotic results can lead to misleading conclusions in modern 
statistical settings where the sample size remains small and the parameter's di- 
mension becomes large. For instance it is proved in [12, Sect. 3. 3. 2] and illustrated 
in [12, Sect. 6. 2] that BIC (and thus AlC) can strongly overfit and should not be 
used for p larger than n. Additional examples are provided in Appendix A. A 
non-asymptotic analysis takes into account all the characteristics of the selection 
problem (sample size n, parameter dimension p, number of models per dimen- 
sion, etc). It treats n and p as they are and it avoids to miss important features 
hidden in asymptotic limits. For these reasons, we will restrict in this review on 
non-asymptotic results. 
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1.5 Organization of the paper 

In Section 2, we investigate how the ignorance of the variance affects the min- 
imax risk bounds. In Section 3, some "generic" estimators selection schemes are 
presented. The coordinate-sparse setting is addressed Section 4 : some theoretical 
results are collected and a small numerical experiment compares different Lasso- 
based procedures. The group-sparse and variation-sparse settings are reviewed in 
Section 5 and 6, and Section 7 is devoted to some more general models such as 
multivariate regression or nonparametric regression. 

In the sequel, C, Ci,. . . refer to numerical constants whose value may vary from 
line to line, while ||/3||o stands for the number of non zero components of (3 and 
\J'\ for the cardinality of a set J'. 

2. THEORETICAL LIMITS 

The goal of this section is to address the intrinsic difficulty of a coordinate- 
sparse linear regression problem. We will answer the following questions: Which 
range of p can we reasonably consider? When the variance is unknown, can we 
hope to do as well as when the variance is known? 

2.1 IVIinimax adaptation 

A classical way to assess the performance of an estimator f3 is to measure its 
maximal risk over a class S C M^. This is the minimax point of view. As we 
are interested in coordinate-sparsity for Pq, we will consider the sets B[k,p] of 
vectors that contain at most k non zero coordinates for some k > 0. 

Given an estimator /3, the maximal prediction risk of f3 over B[k,p] for a fixed 
design X and a variance cr^ is defined by sup^^^g^j;, p] 7^[/3; /3o] where the risk 
function 7^[.,/3o] is defined by (3). Taking the infimum of the maximal risk over 
all possible estimators /3, we obtain the minimax risk 

R[A;,X]=inf sup 7l[^;/3o]. (5) 

^ /3oeB[fc,p] 

Minimax bounds are convenient results to assess the range of problems that are 
statistically feasible and the optimality of particular procedures. Below, we say 
that an estimator /3 is "minimax" over B[k,p] if its maximal prediction risk is 
close to the minimax risk. 

In practice, the number of non-zero coordinates of /3o is unknown. The fact that 
an estimator /3 is minimax over B[k,p] for some specific A: > does not imply 
that 13 estimates well vectors /3o that are less sparse. A good estimation procedure 
/3 should not require the knowledge of the sparsity k of /3o and should perform 
as well as if this sparsity were known. An estimator /3 that nearly achieves the 
minimax risk over B[k,p^ for a range of k is said to be adaptive to the sparsity. 
Similarly, an estimator (3 is adaptive to the variance o"^, if it does not require 
the knowledge of o"^ and nearly achieves the minimax risk for all o"^ > 0. When 
possible, the main challenge is to build adaptive procedures. 

In the following subsections, we review sharp bounds on the minimax prediction 
risks for both known and unknown sparsity, known and unknown variance. The 
big picture is summed up in Figure 1. Roughly, it says that adaptation is possible 
as long as 2klog{p/k) < n. In contrast, the situation becomes more complex for 
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the ultra-high-dimensional^ setting where 2klog{p/k) > n. The rest of this section 
is devoted to explain this big picture. 




k 



I Ultra-high dimension 
2klog{p/k) > n 



Figure 1. Minimal prediction risk over B[k,p] as a function of k. 



2.2 Minimax risks under known sparsity and known variance 

The minimax risk R[A;, X] depends on the form of the design X. In order to 
grasp this dependency, we define for any A; > 0, the largest and the smallest 
sparse eigenvalues of order k of X*X by 



^1.4.fX) 



sup 

/3eB[fc,p]\{0p} 



IX/ 



2 
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fXl 



inf 

/3eB[fc,p]\{0p} 
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Proposition 2.1. Assume that k and a are known. There exist positive nu- 
merical constants Ci, C[, C2, and C2 such that the following holds. For any 
{k,n,p) such that k < n/2 and any design X, we have 



For any {k,n,p) such that k < n/2, we have 



k log ( ^ I 



a 



(6) 



Co 



k log ( ^ I 



< supR[fc,X] < C2 

X 



k log ( ^ I 



(7) 



The minimax lower bound (6) has been first proved in [61, 62, 79] while (7) is 
stated in [77]. Let us first comment the bound (7). If the vector /3o has fc-non zero 
components and if these components are a priori known, then one may build esti- 
mators that achieve a risk bound of the order k. In a (non-ultra) high-dimensional 



^In some papers, the expression ultra-high-dimensional has been used to characterize prob- 
lems such that log(p) = 0(n*) with 9 < 1. We argue here that as soon as fclog(p)/n goes to 
0, the case log(p) = 0{n^) is not intrinsically more difficult than conditions such as p = 0{n^) 
with S > 0. 



6 



GIRAUD, HUET AND VERZELEN 



setting [2klog{p/k) < n), the minimax risk is of the order klog{p/k)a'^ . The log- 
arithmic term is the price to pay to cope with the fact that we do not know 
the position of the non zero components in /3o. The situation is quite different 
in an ultra-high-dimensional setting {2klog{p/k) > n). Indeed, the minimax risk 
remains of the order of na^ , which corresponds to the minimax risk of estimation 
of the vector X/3o without any sparsity assumption (see the blue curve in Figure 
1). In other terms, the sparsity index k does not play a role anymore. 

Dependency of R[A;, X] on the design X. It follows from (6) that supx R[A;, X] 
is nearly achieved by designs X satisfying ^2fc,-(X)/^2fc,+(X) 1, when the set- 
ting is not ultra-high dimensional. For some designs such that <?2fc,-(X)/^2fc,+ (X) 
is small, the minimax prediction risk R[A;, X] is possibly faster (see [77] for a dis- 
cussion). In a ultra- high dimensional, the form of the minimax risk (nci^) is re- 
lated to the fact that no designs can satisfy ^2fc,-(X)/<?2fc,+ (X) 1 (see e.g. [10]). 
The lower bound R[A;,X] > C [klog{p/k) A n] cr^ in (7) is for instance achieved 
by realizations of a standard Gaussian design, that is designs X whose compo- 
nents follow independent standard normal distributions. See [77] for more details. 

2.3 Adaptation to the sparsity and to the variance 

Adaptation to the sparsity when the variance is known. When is 
known, there exist both model selection and aggregation procedures that achieve 
this [klog{p/k) A n]a'^ risk simultaneously for all k and for all designs X. Such 
procedures derive from the work of Birge and Massart [18] and Leung and Bar- 
ron [52]. However, these methods are intractable for large p except for specific 
forms of the design. We refer to Appendix B.l for more details. 

Simultaneous adaptation to the sparsity and the variance. We first re- 
strict to the non-ultra high-dimensional setting, where the number of non-zero 
components k is unknown but satisfies 2k\og{p/h) < n. In this setting, some pro- 
cedures based on penalized log-likelihood [12] are simultaneous adaptive to the 
unknown sparsity and to the unknown variance and this for all designs X. Again 
such procedures are intractable for large p. See Appendix B.2 for more details. If 
we want to cover all k (including ultra- high dimensional settings), the situation 
is different as shown in the next proposition (from [77]). 

Proposition 2.2 (Simultaneous adaptation is impossible). There exist pos- 
itive constants C, C , Ci, C2, C3, C[, C'2, and C^, such that the following holds. 
Consider any p > n > C and k < p^/^ A n/2 such that klog{p/k) > C'n. There 
exist designs X 0/ size n x p such that for any estimator /3, we have either 



sup 



7^[/3;0p] 
a2 



> Cin , 



or 



sup 

/3oeB[fc,p] , 




> 




Conversely, there exist two estimators 13^'^'^ and fi^'^^ (defined in Appendix B.2) 
that respectively satisfy 
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sup 

X 



sup 



sup sup 

X /3oeB[fc,p] , a2>o 



f72>0 ^ 

■JZ0BGH. 



< C[n , 



;/3o] 



< C2A:log (^) exp 



for aUl<k< [{n - 1) A p]/4. 

As a consequence, simultaneous adaptation to the sparsity and to the variance 
is impossible in an ultra- high dimensional setting. Indeed, any estimator /3 that 
does not rely on o"^ has to pay at least one of these two prices: 

1. The estimator /3 does not use the sparsity of the true parameter /3o and its 
risk for estimating XOp is of the same order as the minimax risk over M". 

2. For any 1 < k < p^/^, the risk of /3 fulfills 



sup sup — — > Oi/ciog (pj exp 

o->0 l3oeB[k,p] 



k 

C2-log(p) 
n 



It follows that the maximal risk of /3 is blowing up in an ultra-high-dimensional 
setting (red curve in Figure 1), while the minimax risk is stuck to n (blue 
curve in Figure 1). The designs that satisfy the minimax lower bounds of 
Proposition 2.2 include realizations of a standard Gaussian design. 

In an ultra-high dimensional setting, the prediction problem becomes extremely 
difficult under unknown variance because the variance estimation itself is incon- 
sistent as shown in the next proposition (from [77]). 

Proposition 2.3. There exist positive constants C , Ci, and C2 such that 
the following holds. Assume that p > n > C . For any 1 < k < p^^^ , there exist 
designs X such that 



inf sup 

^ a>0, /3o6B[fc,p] 







?2 






a2 


a2 





>Ci-log 
n 



exp 



C2- log 
n 



2.4 What should we expect from a good estimation procedure? 

Let us consider an estimator {3 that does not depend on a"^. Relying on the 
previous minimax bounds, we will say that (3 achieves an optimal risk bound 
(with respect to the sparsity) if 

7^[^;/3o] <Ci||/3o||olog(p)a2 , (8) 

for any cr > and any vector /3o G W such that 1 < ||/3o||o log(p) < C2n. Such risk 
bounds prove that the estimator is approximately (up to a possible log(||/3o||o) 
additional term) minimax adaptive to the unknown variance and the unknown 
sparsity. The condition ||/3o||o log(p) < ensures that the setting is not ultra- 
high-dimensional. As stated above, some procedures achieve (8) for all designs X 
but they are intractable for large p (see Appendix B). One purpose of this review 
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is to present fast procedures that achieve this kind of bounds under possible 
restrictive assumptions on the design matrix X. 

For some procedures, (8) can be improved into a bound of the form 

7^[^;/3o] <Ciinf {||X(/3-/3o)||i + ||/3||olog(p)a2} , (9) 

with Ci close to one. Again, the dimension ||/3o||o is restricted to be smaller than 
Cn/ log{p) to ensure that the setting is not ultra- high dimensional. This kind of 
bound makes a clear trade-off between a bias and a variance term. For instance, 
when Pq contains many components that are nearly equal to zero, the bound (9) 
can be much smaller than (8). 

2.5 Other statistical problems in an ultra-high-dimensional setting 

We have seen that adaptation becomes impossible for the prediction problem 
in a ultra-high dimensional setting. For other statistical problems, including the 
prediction problem with random design, the inverse problem (estimation of /3o), 
the variable selection problem (estimation of the support of /3o), the dimension 
reduction problem [77, 78, 46], the minimax risks are blowing up in a ultra-high 
dimensional setting. This kind of phase transition has been observed in a wide 
range of random geometry problems [29], suggesting some universality in this 
limitation. In practice, the sparsity index k is not known, but given {n,p) we can 
compute k* := max{A; : 2klog{p/k) > n}. One may interpret that the problem 
is still reasonably difficult as long as k < k* . This gives a simple rule of thumb 
to know what we can hope from a given regression problem. For example, setting 
p = 5000 and n = 50 leads to k* = 3, implying that the prediction problem 
becomes extremely difficult when there are more than 4 relevant covariates (see 
the simulations in [77]). 

3. SOME GENERIC SELECTION SCHEMES 

Among the selection schemes not requiring the knowledge of the variance u^, 
some are very specific to a particular algorithm, while some others are more 
generic. We describe in this section three versatile selection principles and refer 
to the examples for the more specific schemes. 

3.1 Cross- Validation procedures 

The cross-validation schemes are nearly universal in the sense that they can be 
implemented in most statistical frameworks and for most estimation procedures. 
The principle of the cross-validation schemes is to split the data into a training set 
and a validation set : the estimators are built on the training set and the validation 
set is used for estimating their prediction risk. This training / validation splitting 
is eventually repeated several times. The most popular cross-validation schemes 
are : 

• Hold-out [57, 27] which is based on a single split of the data for training 
and validation. 

• V-fold CV [32]. The data is split into V subsamples. Each subsample is suc- 
cessively removed for validation, the remaining data being used for training. 

• Leave-one-out [69] which corresponds to n-fold CV. 
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• Leave-q-out (also called delete-q-CV) [65] where every possible subset of 
cardinality q of the data is removed for validation, the remaining data being 
used for training. 

We refer to Arlot and Celisse [6] for a review of the cross-validation schemes and 
their theoretical properties. 

3.2 Penalized empirical loss 

Penalized empirical loss criteria form another class of versatile selection schemes, 
yet less universal than CV procedures. The principle is to select among a family 
(/3A)AeA of estimators by minimizing a criterion of the generic form 

Crit(A) = Cx{Y, px) + pen(A), (10) 

where -Cx(^, Px) is a measure of the distance between Y and X/3a, and pen is a 
function from A to M^. The penalty function sometimes depends on data. 

Penalized log-likelihood. The most famous criteria of the form (10) are AlC 
and BIC. They have been designed to select among estimators j3\ obtained by 
maximizing the likelihood of (/3, a) with the constraint that (5 lies on a linear 
space S\ (called model). In the Gaussian case, these estimators are given by 
X/3;v = n^^y, where denotes the orthogonal projector onto the model S\. 
For AlC and BIC, the function £x corresponds to twice the negative log- likelihood 
'Cx(^)/5a) = nlog(||y — X/^aIII) and the penalties are pen(A) = 2dim(5A) and 
pen(A) = dim(S'A) log(n) respectively. We recall that these two criteria can per- 
form very poorly in a high-dimensional setting. 

In the same setting, Baraud et al. [12] propose alternative penalties built 
from a non-asymptotic perspective. The resulting criterion can handle the high- 
dimensional setting where p is possibly larger than n and the risk of the selection 
procedure is controlled by a bound of the form (9), see Theorem 2 in [12]. 

Plug-in criteria. Many other penalized-empirical-loss criteria have been developed 
in the last decades. Several selection criteria [14, 18] have been designed from a 
non-asymptotic point of view to handle the case where the variance is known. 
These criteria usually involve the residual least-square £x(^i/3a) = ||^ — X/JaHI 
and a penalty pen(A) depending on the variance cr^. A common practice is then 
to plug in the penalty an estimate cj^ of the variance in place of the variance. For 
linear regression, when the design matrix X has a rank less than n, a classical 
choice for is 

r-nxy||i ^ 

n — rank(X) ' 

with IIx the orthogonal projector onto the range of X. This estimator 6"^ has the 
nice feature to be independent of IIx^^ on which usually rely the estimators I3\. 
Nevertheless, the variance of cj^ is of order o"^/ (n — rank(X)) which is small only 
when the sample size n is quite large in front of the rank of X. This situation is 
unfortunately not likely to happen in a high-dimensional setting where p can be 
larger than n. 

3.3 Approximation versus complexity penalization : LinSelect 

The criterion proposed by Baraud et al. [12] can handle high-dimensional set- 
tings but it suffers from two rigidities. First, it can only handle fixed collections 
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of models (5A)AeA- In some situations, the size of A is huge (e.g. for complete 
variable selection) and the estimation procedure can then be computationally 
intractable. In this case, we may want to work with a subcollection of models 
(S'a)^^^, where A C A may depend on data. For example, for complete variable 
selection, the subset A could be generated by efficient algorithms like LARS [30]. 
The second rigidity of the procedure of Baraud et al. [12] is that it can only handle 
constrained-maximum-likelihood estimators. This procedure then does not help 
for selecting among arbitrary estimators such as the Lasso or Elastic-Net. 

These two rigidities have been addressed recently by Baraud et al. [13]. They 
propose a selection procedure, LinSelect, which can handle both data-dependent 
collections of models and arbitrary estimators f3\. The procedure is based on a 
collection S of linear spaces which gives a collection of possible " approximative" 
supports for the estimators (X/3A)AeA- A measure of complexity on S is provided 
by a weight function A : S — ?■ M"*" . We refer to Sections 4.1 and 5 for examples of 
collection S and weight A in the context of coordinate-sparse and group-sparse 
regression. We present below a simplified version of the LinSelect procedure. For 
a suitable, possibly data-dependent, subset S C S (depending on the statistical 
problem), the estimator (3^ is selected by minimizing the criterion 



Crit(/3A) = inf 



\Y - UsXMl + ^IIX^A - n^X^Alli + pen{S) a| 



fll) 



where Us is the orthogonal projector onto 5 



2 _ \\Y-nsY\\l 



n — dim(5) ' 

and pen(S') is a penalty depending on A. In the cases we will consider here, the 
penalty pen(5') is roughly of the order of A(S') and therefore it penalizes S ac- 
cording to its complexity. We refer to the Appendix C for a precise definition of 
this penalty and more details on its characteristics. We emphasize that the Crite- 
rion (11) and the family of estimators {/3a, A G A} are based on the same data Y 
and X. In other words, there is no data-splitting occurring in the LinSelect pro- 
cedure. The first term in (11) quantifies the fit of the projected estimator to the 
data, the second term evaluates the approximation quality of the space S and the 
last term penalizes S according to its complexity. We refer to Proposition C.l 
in Appendix C and Theorem 1 in [12] for risk bounds on the selected estima- 
tor. Instantiations of the procedure and more specific risks bounds are given in 
Sections 4 and 5 in the context of coordinate-sparsity and group-sparsity. 

From a computational point of view, the algorithmic complexity of LinSelect 
is at most proportional to |A| x |S| and in many cases there is no need to scan 
the whole set S for each A S A to minimize (11). In the examples of Sections 4 
and 5, the whole procedure is computationally less intensive than V-fold CV, 
see Table 3. Finally, we mention that for the constrained least-square estimators 
X/3a = Ilg^y, the LinSelect procedure with § = {5'a : A G A} simply coincides 
with the procedure of Baraud et al. [12]. 

4. COORDINATE-SPARSITY 

In this section, we focus on the high-dimensional linear regression model Y = 
X/3o -|- e where the vector /3o itself is assumed to be sparse. This setting has 
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attracted a lot of attention in the last decade, and many estimation procedures 
have been developed. Most of them require the choice of tuning parameters which 
depend on the unknown variance a"^ . This is for instance the case for the Lasso [72, 
24], Dantzig Selector [22], Elastic Net [85], MC+ [81], aggregation techniques [21, 
26], etc. 

We first discuss how the generic schemes introduced in the previous section can 
be instantiated for tuning these procedures and for selecting among them. Then, 
we pay a special attention to the calibration of the Lasso. Finally, we discuss the 
problem of support estimation and present a small numerical study. 

4.1 Automatic tuning methods 

Cross-validation. Arguably, l^-fold Cross- Validation is the most popular tech- 
nique for tuning the above-mentioned procedures. To our knowledge, there are 
no other theoretical results for y-fold CV in large dimensional settings. 

In practice, y-fold CV seems to give rather good results. The problem of choos- 
ing the best V has not yet been solved [6, Section 10], but it is often reported 
that a good choice for V is between 5 and 10. Indeed, the statistical performance 
does not increase for larger values of V ^ and averaging over 10 splits remains 
computationally feasible [41, Section 7.10]. 

LinSelect. The procedure LinSelect can be used for selecting among a collection 
(/3A)AeA of sparse regressors as follows. For JT" C {1, . . . we define Xj- as the 
matrix \^i^i=\^...^n, j&j obtained by only keeping the columns of X with index in 
J . We recall that the collection S gives some possible "approximative" supports 
for the estimators (Kj3x)\^K- For sparse linear regression, a possible collection S 
and measure of complexity A are 

S = {5 = range(X^), Jc{l,...,ri, 1< IJI <n/(31ogp)} 

and A(5) = log(^^J^^^) +log(dim(5)). 



Let us introduce the spaces S\ = range (^X_,^pp^^_^^ j and the subcollection of S 
§ = a G a} , where A = |a G A : 5a G s} . 

The following proposition gives a risk bound when selecting A with LinSelect with 
the above choice of S and A. 

Proposition 4.1. There exists a numerical constant C > 1 such that for any 
minimizer X of the Criterion (11), we have 



n 



/3a; /3o 

< CE 

< CE 



inf IIX^A - X/3o||2 + inf jllX^A - Hs^^AWl + dim{S)log{p)a^} 
inf jllX^A - X/3o||^ + II^aIIo log{p)a^} 



5e§ 

(12 
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Proposition 4.1 is a simple corollary of Proposition C.l in Appendix C. The 
first bound involves three terms: the loss of the estimator /3a, an approximation 
loss, and a variance term. Hence, LinSelect chooses an estimator f3x that achieves 
a trade-off between the loss of Px and the closeness of X/3a to some small di- 
mensional subspace S. The bound (12) cannot be formulated in the form (9) 
due to the random nature of the set A. Nevertheless, a bound similar to (8) can 
be deduced from (12) when the estimators Px are least-squares estimators, see 
Corollary 4 in [13]. Furthermore, we note that increasing the size of A leads to 
a better risk bound for /3^. It is then advisable to consider a family of candidate 

estimators {Px, A G A} as large as possible. The Proposition 4.1 is valid for 
any family of estimators {/3a, A G A}, for the specific family of Lasso estimators 
{/3_^, A > 0} we provide a refined bound in Proposition 4.3, Section 4.3. 

4.2 Lasso-type estimation under unknown variance 



The Lasso is certainly one of the most popular methods for variable selection 
in a high-dimensional setting. Given A > 0, the Lasso estimator /3^ is defined by 
/3^ := argmin^gjgp ||y—X/3||2-|-A||/?||i. A sensible choice of A must be homogeneous 
with the square- root of the variance cr^. As explained above, when the variance cr^ 
is unknown, one may apply l^-fold CV or LinSelect to select A. Some alternative 
approaches have also been developed for tuning the Lasso. Their common idea 
is to modify the ii criterion so that the tuning parameter becomes pivotal with 
respect to o"^. This means that the method remains valid for any a > and 
that the choice of the tuning parameter does not depend on a. For the sake 
of simplicity, we assume throughout this subsection and the next one that the 
columns of X are normalized to one. 

^i-penalized log-likelihood. In low-dimensional regression, it is classical to con- 
sider a penalized log-likelihood criterion instead of a penalized least-square crite- 
rion to handle the unknown variance. Following this principle, Stadler et al. [68] 
propose to minimize the ^i-penalized log-likelihood criterion 



Px^,^x^ ■■= argmin 

lP.a'>0 



y-x/3||^ , „^„, , , 

^ + A^ . (13) 



By reparametrizing (/3,cr), Stadler et al. [68] obtain a convex criterion that can 
be efficiently minimized. Interestingly, the penalty level A is pivotal with respect 
to cr. Under suitable conditions on the design matrix X, Sun and Zhang [70] show 
that the choice A = c\/2Togp, with c > 1 yields optimal risk bounds in the sense 
of (8). 

Square-root Lasso and scaled Lasso. Sun and Zhang [71], following an idea of 
Antoniadis [3], propose to minimize a penalized Ruber's loss [44, page 179] 



P\ I'^X ■= argmin 

,(t'>0 



na' ||y-X/ 



2 +^^ + ^ 



(14) 



This convex criterion can be minimized with roughly the same computational 
complexity as a Lars-Lasso path [30]. Interestingly, their procedure (called the 
scaled Lasso in [71]) is equivalent to the square-root Lasso estimator previously 
introduced by Belloni et al. [16]. The square-root Lasso of Belloni et al. is obtained 
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by replacing the residual sum of squares in the Lasso criterion by its square-root 

A 



/3f^ = argmin - X/3||2 + -L||/3||i . (15) 

The equivalence between the two definitions follows from the minimization of the 
criterion in (14) with respect to a' . In (14) and (15), the penalty level A is again 
pivotal with respect to a. Sun and Zhang [71] state sharp oracle inequalities for 
the estimator with A = cy^2 log(p), with c > 1 (see Proposition 4.2 below). 
Their empirical results suggest that the criterion (15) provides slightly better 
results than the £^-penalized log-likelihood. In the sequel, we shall refer to 
as the square- root Lasso estimator. 

Bayesian Lasso. The Bayesian paradigm allows to put prior distributions on the 
variance and the tuning parameter A, as in the Bayesian Lasso [60]. Bayesian 
procedures straightforwardly handle the case of unknown variance, but no fre- 
quentist analysis of these procedures are so far available. 

4.3 Risk bounds for square-root Lasso and Lasso-LinSelect 

Let us state a bound on the prediction error for the square-root Lasso (also 
called scaled Lasso). For the sake of conciseness, we only present a simplified 
version of Theorem 1 in [71]. Consider some number ^ > and some subset 
r C {1, . . . The compatibility constant T] is defined by 

{|T|-^/^||Xti||2 1 
7. 7. >, where C(^, T) = {ti : lluyc Ih < 
IFtIIi J 

Proposition 4.2. There exist positive numerical constants Ci, C2, and C3 
such that the following holds. Let us consider the square-root Lasso with the tuning 
parameter A = 2y^2 log(p). // we assume that 

1. P>Ci 

2. ||/3o||o<C2k2[4,supp(/3o)]ij^, 
then, with high probability, 

SR M|2 ^ «M|2 , ^ ll/3||olog(p) o 



This bound is comparable to the general objective (9) stated in Section 2.4. 
Interestingly, the constant before the bias term ||X(/3o — /3)||| equals one. If 
11/^0 llo = the square-root Lasso achieves the minimax loss klog{p)a^ as long 
as klog{p)/n is small and k[4, supp(/3o)] is away from zero. This last condi- 
tion ensures that the design X is not too far from orthogonality on the cone 
C(4, supp(/3o))- State of the art results for the classical Lasso with known vari- 
ance [17, 48, 74] all involve this condition. 

We next state a risk bound for the Lasso-LinSelect procedure. For J7 C {1, . . . ,p}, 
we define (f>j as the largest eigenvalue of XjXj. The following proposition in- 
volves the restricted eigenvalue = mayi{(f)j : Card(J') < n/(31ogp)} . 
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Proposition 4.3. There exist positive numerical constants C , Ci, C2, and 
C3 such that the following holds. Take A = and assume that 

K^[5,supp(/3o)] n 
llPollo S o X 



log(p) 

Then, with probability at least 1 — Cip^*^^, the Lasso estimator f3j selected ac- 
cording to the LinSelect procedure described in Section 4-1 fulfills 

X(A-?i) ;<C3mfi||X(ft-;3)||:j+ ^;)f"°'°';'''> .4 ^ (16) 

The bound (16) is similar to the bound stated above for the square-root Lasso, 
the most notable differences being the constant larger than 1 in front of the 
bias term and the quantity (p^, in front of the variance term. We refer to the 
Appendix E for a proof of Proposition 4.3. 

4.4 Support estimation and inverse problem 

Until now, we only discussed estimation methods that perform well in pre- 
diction. Little is known when the objective is to infer Pq or its support under 
unknown variance. 

Inverse problem. The square-root Lasso [71, 16] is proved to achieve near optimal 
risk bound for the inverse problems under suitable assumptions on the design X. 



Support estimation. Up to our knowledge, there are no non-asymptotic results on 
support estimation for the aforementioned procedures in the unknown variance 
setting. Nevertheless, some related results and heuristics have been developed 
for the cross-validation scheme. If the tuning parameter A is chosen to minimize 
the prediction error (that is take A = A* as defined in (4)), the Lasso is not 
consistent for support estimation (see [51, 56] for results in a random design 
setting). One idea to overcome this problem, is to choose the parameter A that 
minimizes the risk of the so-called Gauss-Lasso estimator which is the least 
square estimator over the support of the Lasso estimator 

^f^:= argmin ^ \\Y -Xf3\\l . (17) 

/3eKP:supp(/3)Csupp(/3|') 

When the objective is support estimation, some numerical simulations [62] sug- 
gest that it may be more advisable not to apply the selection schemes based 
on prediction risk (such as y-fold CV or LinSelect) to the Lasso estimators but 
rather to the Gauss-Lasso estimators. Similar remarks also apply for the Dantzig 
Selector [22]. 

4.5 Numerical Experiments 

We present two numerical experiments to illustrate the behavior of some of the 
above mentioned procedures for high-dimensional sparse linear regression. The 
first one concerns the problem of tuning the parameter A of the Lasso algorithm 
for estimating X/3o. The procedures will be compared on the basis of the pre- 
diction risk. The second one concerns the problem of support estimation with 
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Lasso-type estimators. We will focus on the false discovery rates (FDR) and the 
proportion of true discoveries (Power). 

Simulation design. The simulation design is the same as the one described in 
Sections 6.1, and 8.2 of [13], except that we restrict to the case n = p = 100. 
Therefore, 165 examples are simulated. They are inspired by examples found 
in [72, 85, 84, 42] and cover a large variety of situations. The simulation were 
carried out with R (www.r-project.org), using the library elasticnet. 

Experiment 1 : tuning the Lasso for prediction. 

In the first experiment, we compare 10-fold CV [32], LinSelect [13] and the square- 
root Lasso [16, 71] (also called scaled Lasso) for tuning the Lasso. Concerning the 
square-root Lasso, we set A = 2y^2 log(p) (as suggested in [71]) and we compute 
the estimator using the algorithm described in Sun and Zhang [71]. 

For each tuning procedure i S {10- fold CV, LinSelect, square-root Lasso}, we 
focus on the prediction risk TZ 



/3[;/3o 



of the selected Lasso estimator /3f 



For each simulated example e 
runs 



1, . . . , 165, we estimate on the basis of 400 



the risk of the oracle (4) : TZe = TZ 



/3a*;/3o 



the risk when selecting A with procedure £ : TZie = TZ 



The comparison between the procedures is based on the comparison of the 
means, standard deviations and quantiles of the risk ratios TZi^^/TZe computed 
over all the simulated examples e = 1, . . . , 165. The results are displayed in Ta- 
ble 1. 



procedure 


mean 


std-err 


0% 


50% 


quantiles 

75% 


90% 


95% 


Lasso 10-fold CV 


1.13 


0.08 


1.03 


1.11 


1.15 


1.19 


1.24 


Lasso LinSelect 


1.19 


0.48 


0.97 


1.03 


1.06 


1.19 


2.52 


Square-root Lasso 


5.15 


6.74 


1.32 


2.61 


3.37 


11.2 
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Table 1 

For each procedure I, mean, standard- error and quantiles of the ratios 
{'Tlt.e/n^,e = 1,...,165}. 



For 10-fold CV and LinSelect, the risk ratios are close to one. For 90% of the 
examples, the risk of the Lasso-LinSelect is smaller than the risk of the Lasso-CV, 
but there are a few examples where the risk of the Lasso-LinSelect is significantly 
larger than the risk of the Lasso-CV. For the square-root Lasso procedure, the 
risk ratios are clearly larger than for the two others. An inspection of the results 
reveals that the square-root Lasso selects estimators with supports of small size. 
This feature can be interpreted as follows. Due to the bias of the Lasso-estimator, 
the residual variance tends to over-estimate the variance, leading the square-root 
Lasso to select a Lasso estimator (3^ with large A. Consequently the risk is high. 

Experiment 2 : variable selection with Gauss-Lasso and square-root Lasso. 

We consider now the problem of support estimation, sometimes referred as the 
problem of variable selection. We implement three procedures. The Gauss-Lasso 
procedure tuned by either 10-fold CV or LinSelect and the square-root Lasso. The 
support of /3o is estimated by the support of the selected estimator. 
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For each simulated example, the FDR and the Power are estimated on the 
basis of 400 runs. The results are given on Table 2. 

False Discovery rate 











quantiles 






procedure 


mean 


std-err 


0% 


25% 


50% 


75% 


90% 


Gauss-Lasso 10-fold CV 


0.28 


0.26 





0.08 


0.22 


0.35 


0.74 


Gauss-Lasso LinSelect 


0.12 


0.25 





0.002 


0.02 


0.13 


0.33 


Square-root Lasso 


0.13 


0.26 





0.009 


0.023 


0.07 


0.32 






Power 


















quantiles 






procedure 


mean 


std-err 


0% 


25% 


50% 


75% 


90% 


Gauss-Lasso 10-fold CV 


0.67 


0.18 


0.4 


0.52 


0.65 


0.71 


1 


Gauss-Lasso LinSelect 


0.56 


0.33 


0.002 


0.23 


0.56 


0.93 


1 


Square-root Lasso 


0.59 


0.28 


0.013 


0.41 


0.57 


0.80 


1 






Table 


2 











For each procedure £, mean, standard- error and quantiles of FDR and Power values. 

It appears that the Gauss-Lasso CV procedure gives greater values of the FDR 
than the two others. The Gauss-Lasso LinSelect and the square-root Lasso behave 
similarly for the FDR, but the values of the power are more variable for the 
LinSelect procedure. 

Computation time. 

Let us conclude this numerical section with the comparison of the computation 
times between the methods. For all methods the computation time depends on the 
maximum number of steps in the lasso algorithm and for the LinSelect method, 
it depends on the cardinality of S or equivalently on the maximum number of 
non-zero components of /3. The results are shown at Table 3. The square-root 
Lasso is the less time consuming method, closely followed by the Lasso LinSelect 
method. The l^-fold CV carried out with the function cv . enet of the R package 
elasticnet, pays the price of several calls to the lasso algorithm. 



n 


P 


max. steps 


^max 


Lasso 10-fold CV 


Lasso LinSelect 


Square-root Lasso 


100 


100 


100 


21 


4 s 


0.21 s 


0.18 s 


100 


500 


100 


16 


4.8 s 


0.43 s 


0.4 s 


500 


500 


500 


80 


300 s 


11 s 


6.3 s 



Table 3 

For each procedure computation time for different values of n and p. The maximum number of 
steps in the lasso algorithm, is taken as max. steps = min{n,p}. For the LinSelect procedure, 
the maximum number of non-zero components of (3, denoted fcmax is taken as 
fcmax = min {p, n/ log(p)}. 

5. GROUP-SPARSITY 

In the previous section, we have made no prior assumptions on the form of Pq. 
In some applications, there are some known structures between the covariates. As 
an example, we treat the now classical case of group sparsity. The covariates are 
assumed to be clustered into M groups and when the coefficient /3o,t correspond- 
ing to the covariate Xj is non-zero then it is likely that all the coefficients /3oj 
with variables Xj in the same group as Xj are non-zero. We refer to the intro- 
duction of [8] for practical examples of this so-called group-sparsity assumption. 
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Let Gi, . . . , Gm form a given partition of {1, . . . For A = (Ai, . . . , Xm)-, the 
group-Lasso estimator fix is defined as tlie minimizer of the convex optimization 
criterion 

M 

||y-X/3||2 + J^A,||/3^Ml2, (18) 
fc=i 

where (3^'^ = {j3j)j^Gk - The Criterion (18) promotes solutions where all the coor- 
dinates of fi^*' are either zero or non-zero, leading to group selection [80]. Under 
some assumptions on X, Huang and Zhang [43] or Lounici et al. [54] provide a 
suitable choice of A = (Ai, . . . , Xm) that leads to near optimal prediction bounds. 
As expected, this choice of A = (Ai, . . . , Am) is proportional to cr. 

As for the Lasso, l^-fold CV is widely used in practice to tune the penalty 
parameter A = (Ai, . . . , Am)- To our knowledge, there is not yet any extension 
of the procedures described in Section 4.2 to the group Lasso. An alternative to 
cross-validation is to use LinSelect. 

Tuning the group-Lasso with LinSelect. For any /C C {1, . . . , M}, we define the 
submatrix ^i^ic) of X by only keeping the columns of X with index in IJfceA: 
We also write X^^. for the submatrix of X built from the columns with index in 
Gk- The collection S and the function A are given by 

range(X(^)) : 1 < |/C| < n/(31og(Af)) and ^ \Gk\ < n/2 - 1 

k&K 

and A(range(X(;c))) = log |/C|('^') . For a given A C M^, similarly to Sec- 
tion 4.1, we define Kx = \k: Wf^h + o| and 

S = |range(X^£^p, A G a| , with A = |a G A, range(X^£^-j) G s| . 
Proposition C.l in Appendix C ensures that we have for some constant C > 1 



7^ 



/3t;/3o 



< C E 



mf {||X^A-X/3o||i+ (II^aIIo V I^aIMM)) a^} 



In the following, we provide a more explicit bound. For simplicity, we restrict 
to the specific case where each group has the same cardinality T. For /C C 
{1, . . . , M}, we define as the largest eigenvalue of X^^X(y(;) and we set 



max 



n — 2 



We assume that all the columns of X are normalized to 1 and following Lounici 
et al. [54], we introduce for 1 < s < M 

1 1 1 1 

kg\C,s]= min min j r— (20) 

i<|/c|<s «er{5,/c) \\u^jc)h 

where r(^,/C) is the cone of vectors u £ \ {0} such that Efce/c^ ^kWu^'^h < 
SfceA: '^fcll'u'^'' ||2- In the sequel, JCq stands for the set of groups containing non- 
zero components of Pq. 
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Proposition 5.1. There exist positive numerical constants C, Ci, C2, and 
C3 such that the following holds. Assume that A contains IJAeiR+ • ■ • ' '^)}' 
that T < (n — 2)/4 and that 

i<lK„l<c^*!yM, «-2 



log(M) V T ■ 

Then, with probability larger than 1 — CiM~'-^^ , we have 

X(3o\\l<Cs^^p-— \lCo\ (rvlog(M)). 
K^[3, |/Co|] 

This proposition provides a bound comparable to the bounds of Lounici et 
al. [54] , without requiring the knowledge of the variance. Its proof can be found 
in Appendix E. 

6. VARIATION-SPARSITY 

We focus in this section on the variation- sparse regression. We recall that the 
vector € M^"^ of the variations of j3 has for coordinates jij = fij+i — /3j and 
that the variation-sparse setting corresponds to the setting where the vector of 
variations (3q is coordinate-sparse. In the following, we restrict to the case where 
n = p and X is the identity matrix. In this case, the problem of variation-sparse 
regression coincides with the problem of segmentation of the mean of the vector 
Y = Po + e. 

For any subset X C {1, . . . , n - 1}, we define Sx = {P eR"" : supp(/3^) C X} 
and f3x = ^Sx^- For any integer g G {0, . . . , n — 1}, we define also the "best" 
subset of size q by 

Iq = argmin \\Y - hWl- 

Though the number of subsets X C {l,...,n — 1} of cardinality q is of order 
n^~^^, this minimization can be performed using dynamic programming with a 
complexity of order [39]. To select Z = Xq with gin{0,...,n — 1}, any of the 
generic selection schemes of Section 3 can be applied. Below, we instantiate these 
schemes and present some alternatives. 

6.1 Penalized empirical loss 

When the variance cr^ is known, penalized log-likelihood model selection amounts 
to select a subset X which minimizes a criterion of the form jjy— /3x|||+pen(Card(X)). 
This is equivalent to select X = Xq with q minimizing 

Crit(g) = ||y-^j ||2+pen(g). (21) 

Following the work of Birge and Massart [18], Lebarbier [50] considers the 
penalty 

pen(g') = (g + 1) (ci log(n/ {q + 1)) + C2) cj^ 

and determines the constants ci = 2, C2 = 5 by extensive numerical experiments 
(see also Comte and Rozenholc [25] for a similar approach in a more general 



REGRESSION WITH UNKNOWN VARIANCE 



19 



setting). With this choice of the penalty, the procedure satisfies a bound of the 
form 



7^ 



%,/3ol <C inf |||^x-/3oi + (l + |X|)log(n/(l + |X|))a2|. (22) 

. J Xc{l,...,n-1} 1. J 

When cr^ is unknown, several approaches have been proposed. 

Plug-in estimator. The idea is to replace o"^ in pen(g) by an estimator of the 

variance such as = X]r=i(^2i — ^i-i)^/?^) or one of the estimators proposed by 
Hall and al. [40]. No theoretical results are proved in a non-asymptotic framework. 



Estimating the variance by the residual least-squares. Baraud et al. [12] Section 
5.4.2 propose to select q by minimizing a penalized log-likelihood criterion. This 
criterion can be written in the form Crit(g) = \\Y — fij [[^(l + -firpen(g)), with 
K > 1 and the penalty pen(g) solving 

E[{U-pen{q)V)^ 



('?+l)(V) 

where (.)+ = max(.,0), and U, V are two independent variables with respec- 
tively q + 2 and n — q — 2 degrees of freedom. The resulting estimator with 
I = Iij, satisfies a non asymptotic risk bound similar to (22) for sdl K > 1. The 
choice K = 1.1 is suggested for the practice. 

Slope heuristic. Lebarbier [50] implements the slope heuristic introduced by Birge 
and Massart [19] for handling the unknown variance u^. The method consists 
in calibrating the penalty directly, without estimating a^. It is based on the 
following principle. First, there exists a so-called minimal penalty pen^^^{q) such 
that choosing pen(g) = Kpenj^:^^{q) in (21) with K < 1 can lead to a strong 
overfit, whereas for K > 1 the bound (22) is met. Second, it can be shown 
that there exists a dimension jump around the minimal penalty, allowing to 
estimate pen^^^{q) from the data. The slope heuristic then proposes to select 
q by minimizing the criterion Crit(g) = ||^ — /^t Hi + "^P^mmil)- Arlot and 
Massart [7] provide a non asymptotic risk bound for this procedure. Their results 
are proved in a general regression model with heteroscedatic and non Gaussian 
errors, but with a constraint on the number of models per dimension which is not 
met for the family of models iSi)xc{i,...,n-i}- Nevertheless, the authors indicate 
how to generalize their results for the problem of signal segmentation. 

Finally, for practical issues, different procedures for estimating the minimal 
penalty are compared and implemented in Baudry et al. [15]. 

6.2 CV procedure 

In a recent paper, Arlot and Celisse [5] consider the problem of signal segmen- 
tation using cross-validation. Their results apply in the heteroscedastic case. They 
consider several CV- methods, the leave-one-out, leave-p-out and l/-fold CV for es- 
timating the quadratic loss. They propose two cross-validation schemes. The first 
one, denoted Procedure 5, aims to estimate directly E ||/3o — /?/ Hi > while the 
second one, denoted Procedure 6, relies on two steps where the cross-validation is 
used first for choosing the best partition of dimension q, then the best dimension 
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q. They show that the leave-p-out CV method can be implemented with a com- 
plexity of order n^, and they give a control of the expected CV risk. The use of 
CV leads to some restrictions on the subsets X that compete for estimating /3o. 
This problem is discussed in [5], Section 3 of the supplemental material. 

6.3 Alternative for very high-dimensional settings 

When n is very large, the dynamic programming optimization can become com- 
putationally too intensive. An attractive alternative is based on the fused Lasso 
proposed by Tibshirani et al. [73]. The estimator /3j^ is defined by minimizing 
the convex criterion 

r-/3||i + Aj;i/3,+i-/3,|, 
i=i 

where the total- variation norm "Y^- |/3j+i — /3j| promotes solutions which are 

variation-sparse. The family (/3J^)a>o can be computed very efficiently with the 
LARS-algorithm, see Vert and Bleakley [75]. A sensible choice of the parameter 
A must be proportional to a. When the variance is unknown, the parameter 
A can be selected either by F-fold CV or by LinSelect (see Section 5.1 in [13] for 
details) . 

7. EXTENSIONS 

7.1 Gaussian design and graphical models 

Assume that the design X is now random and that the n rows X^*) are in- 
dependent observations of a Gaussian vector with mean Op and unknown co- 
variance matrix S. This setting is mainly motivated by applications in com- 
pressed sensing [28] and in Gaussian graphical modeling. Indeed, Meinshausen 
and Biihlmann [56] have proved that it is possible to estimate the graph of a 
Gaussian graphical model by studying linear regression with Gaussian design 
and unknown variance. If we work conditionally on the observed X design, then 
all the results and methodologies described in this survey still apply. Nevertheless, 
these prediction results do not really take into account the fact that the design is 
random. In this setting, it is more natural to consider the integrated prediction 
risk IE[||eV2(/3 _ /3g)||2] rather than the risk (3). Some procedures [34, 76] have 
been proved to achieve optimal risk bounds with respect to this risk but they 
are computationally intractable in a high-dimensional setting. In the context of 
Gaussian graphical modeling, the procedure GGMselect [38] is designed to select 
among any collection of graph estimators and it is proved to achieve near optimal 
risk bounds in terms of the integrated prediction risk. 

7.2 Non Gaussian noise 

A few results do not require that the noise e follows a Gaussian distribution. 
The Lasso- type procedures such as the square-root Lasso [71, 16] do not require 
the normality of the noise and extend to other distributions. In practice, it seems 
that cross-validation procedures still work well for other distributions of the noise. 

7.3 Multivariate regression 

Multivariate regression deals with T simultaneous linear regression models yk = 
X/3fc + Efc, k = 1, . . . ,T. Stacking the y^'s in a n x T matrix Y , we obtain the 
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model Y = Xi?o + E, where Bq is a p x T matrix with columns given by /3fc 
and E is a. n X T matrix with i.i.d. entries. The classical structural assumptions 
on Bq are either that most rows of Bq are identically zero, or the rank of Bq is 
small. The first case is a simple case of group sparsity and can be handled by 
the group-lasso as in Section 5. The second case, first considered by Anderson [2] 
and Izenman [45], is much more non-linear. Writing \\-\\f for the Frobenius (or 
Hilbert-Schmidt) norm, the problem of selecting among the estimators 

Br = argmin \\Y — Xi?|||n, r G {1, . . . , min(T, rank(X))} 

B:vank(B)<r 

has been investigated recently from a non-asymptotic point of view by Bunea et 
al. [20] and Giraud [36]. The prediction risk of B^ is of order of 



E 



Xi?,. -XSoll?.] X ^4(XSo) + r-(n + rank(X))CT2, 



k>r 



where Sk{M) denotes the fc-th largest singular value of the matrix M. Therefore, 
a sensible choice of r depends on a^. The first selection criterion introduced 
by Bunea et al. [20] requires the knowledge of the variance u^. To handle the 
case of unknown variance, Bunea et al. [20] propose to plug an estimate of the 
variance in their selection criterion (which works when rank(X) < n), whereas 
Giraud [36] introduces a penalized log-likelihood criterion independent of the 
variance. Both papers provide oracle risk bounds for the resulting estimators 
showing rate-minimax adaptation. 

Several recent papers [9, 58, 63, 20, 48] have investigated another strategy for 
the low-rank setting. For a positive A, the matrix Bq is estimated by 

Bx G argmin|||y-XS||| + A Vsfc(S)|. 

Translating the work on trace regression of Koltchinskii et al. [48] into the set- 
ting of multivariate regression provides (under some conditions on X) an oracle 
bound on the risk of Bx* with A* = 3si(X)(\/T + y^rank(X) )a. We also refer 
to Giraud [37] for a slight variation of this result requiring no condition on the 
design X. Again, the value of A* is proportional to a. To handle the case of un- 
known variance, Klopp [47] adapts the concept of the square-root Lasso [16] to 
this setting and provides an oracle risk bound for the resulting procedure. 

7.4 Nonparametric regression 

In the nonparametric regression model (2), classical estimation procedures in- 
clude local-polynomial estimators, kernel estimators, basis-projection estimators, 
fc-nearest neighbors etc. All these procedures depend on one (or several) tuning 
parameter (s), whose optimal value(s) scales with the variance o"^. ^-fold CV is 
widely used in practice for choosing these parameters, but little is known on its 
theoretical performance. 

The class of linear estimators (including spline smoothing, Nadaraya estima- 
tors, /c-nearest neighbors, low-pass filters, kernel ridge regression, etc) has at- 
tracted some attention in the last years. Some papers have investigated the tuning 
of some specific family of estimators. For example, Cao and Golubev [23] provides 
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a tuning procedure for spline smoothing and Zhang [82] analyses in depth kernel 
ridge regression. Recently, two papers have focused on the tuning of arbitrary 
linear estimators when the variance o"^ is unknown. Arlot and Bach [4] generalize 
the slope heuristic to symmetric linear estimators with spectrum in [0, 1] and 
prove an oracle bound for the resulting estimator. Baraud et al. [13] Section 4 
shows that LinSelect can be used for selecting among a (almost) completely ar- 
bitrary collection of linear estimators (possibly non-symmetric and/or singular). 
Corollary 2 in [13] provides an oracle bound for the selected estimator under the 
mild assumption that some effective dimension of the linear estimators is not 
larger than a fraction of n. 
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APPENDIX A: A NOTE ON BIC TYPE CRITERIA 

The BIC criterion has been initiahy introduced [64] to select an estimator 
among a coUection of constrained maximum hkeUhood estimators. Nevertheless, 
modified versions of this criterion are often used for tuning more general esti- 
mation procedures. The purpose of this appendix is to illustrate why we advise 
against this approach in a high-dimensional setting. 
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Definition A.l. A Modified BIC criterion. Suppose we are given a col- 
lection (/3a)a6A o/ estimators depending on a tuning parameter A G A. For any 
A G A, we consider = ||y — X/SaH^/'^^, o-nd define the modified BIC 

A G argmin{-2L„(/3A,?A) +log(n)||/3A||o} , (A.l) 
AeA 

where L„ is the log-likelihood and A=|agA: ||/3a||o^"'/2|- 

Sometimes, the log(n) term is replaced by log(p). Replacing A by A allows to 
avoid trivial estimators. First, we would like to emphasize that there is no theo- 
retical warranty that the selected estimator does not overfit in a high-dimensional 
setting. In practice, using this criterion often leads to overfitting. Let us illustrate 
this with a simple experiment. 

Setting. We consider the model 

Yi = ^o,i + ei, i = l,...,n, (A.2) 

with e ~ J\f{0,a'^ln) so that p = n and X = In- Here, we fix n = 10000, a = 1 
and /3o = 0„. 

Methods. We apply the modified BIC criterion to tune the Lasso [72], SCAD [31] 
and the hard thresholding estimator. The hard thresholding estimator (3^'^ is 
defined for any A > by [P^'^ji = yil|y^|>A. Given A > and a > 2, the 
SCAD estimator Pf^^^ is defined as the minimizer of the penalized criterion 
\\y - X/3||i + EtiPM) . where for x > 0, 

p'xix) = Xlx<x + (aX - x)+lx>x/ (a - 1) . 

For the sake of simphcity we fix a = 3. We note and j3^T;B\c 

for the Lasso, hard thresholding, and SCAD estimators selected by the modified 
BIC criterion. 

Results. We have realized N = 200 experiments. For each of these experiments, 
the estimator /3^'^'*^, a,nd f]^"^'^^^ are computed. The mean number of 

non-zero components and the estimated risk 7^[/3*'^'^; On ] are reported in Table 1 . 





LASSO 


SCAD 


Hard Thres. 


^[^*'^'SOp] 


4.6x10-2 


1.6x10^ 


3.0x10^ 


Mean of ||^*'^'^||o 


0.025 


86.9 


28.2 



Table 1: Estimated risk and Estimated number of non zero components for /3 • , 
^SCAD;B\c^ and ^^^;Bic_ 

Obviously, the SCAD and hard Thresholding methods select too many irrel- 
evant variables when they are tuned with BIC. Moreover, their risks are quite 
high. Intuitively, this is due to the fact that the log(n) (or log(p)) term in the 
BIC penalty is too small in this high-dimensional setting (n = p). 
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For the Lasso estimator, a very specific phenomenon occurs due to the soft 
thresholding effect. In the discussion of [30], Loubes and Massart advocate that 
soft thresholding estimators penalized by Mallows' Cp [55] penalties should yield 
good results, while hard thresholding estimators penalized by Mallows' Cp are 
known to highly overfit. This strange behavior is due to the bias of the soft 
thresholding estimator. Nevertheless, Loubes and Massart' arguments have been 
developed for an orthogonal design. In fact, there is no non-asymptotic justi- 
fication that the Lasso tuned by BIC or AlC performs well for general designs 
X 

Conclusion. The use of the modified BIC criterion to tune estimation procedures 
in a high-dimensional setting is not supported by theoretical results. It is proved 
to overfit in the case of thresholding estimators [12, Sect. 3.2.2]. Empirically, BIC 
seems to overfit except for the Lasso. We advise the practitioner to avoid BIC 
(and AlC) when p is at least of the same order as n. For instance, LinSelect is 
supported by non-asymptotic arguments and by empirical results [13] in contrast 
to BIC. 

APPENDIX B: MINIMAX ADAPTIVE PROCEDURES 

In this section, we detail procedures that are minimax adaptive to the sparsity 
k simultaneously for all designs X in the sense of (7). In most settings, these 
procedures are not of practical interest as they are intractable for large p. We 
present them as theoretical benchmarks to assess the quality of fast procedures. 

Given a subspace S of M" , we define (3-^ as a least-squares estimator of /3o such 
that X/3 is included in S: 

argmin ||y - X/3||2 . 



IT, X/3e5 

We consider the collections of subspaces: 

§1 = {s = range(X^), Jc{l,...,rf\{0}, 2|J| [1 + log(^j/|J|)] < n} 

U i"ange(X|i_...p}) , 

§2 = {s = range(X^), Jc{l,...,ri\{0}, |J| < (n - l)/4} . 

Finally, we note k* := maxj/c : 2k[l+\og{p / k)] < n}. To simplify the presentation, 
we assume throughout this section that n < p and that Rank(X) > k* . 

B.l Known variance 

A penalization strategy. The model selection paradigm aims at selecting an 
estimator with the smallest possible risk. One strategy to tackle the selection 
problem amounts to minimizing a least-squares criterion penalized by the "com- 
plexity" of the collection of models under consideration. We select as one 
minimizer over 5 G Si of the following criterion 



|y-n5y||^ + 



2 , J 4dim(S) 4 + log(g^) a' if dim(5) < 



2ncr2 if dim(S') = Rank(X) , 



We write := P^gj^^ More general forms of penalties are discussed in [18]. 
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An aggregation strategy. In contrast to model selection, model aggregation 
aims at mixing a collection of estimators. Following, Leung and Barron [52], we 
mix the least-squares estimators f3s in the following way 



/3 



LB ._ 



where the weights us sum to one and for any 5 £ §i, ci;^ is proportional to 



exp 



\Y -UsY\\l+2a^ dim(^) 



1 



if dim(5) < k* 

if dim(5) =Rank(X). 



We refer to [52] for more general forms of the aggregation procedures. 

Risk bounds. In the next proposition, we state that and (3^^ are minimax 
adaptive to the sparsity for all designs X in the sense of (7). 

Proposition B.l. There exist numerical constants Ci and C2 such that the 
following holds. For any design X, any k € and any vector (Sq such 

that ||/3o||o = k, we have 



n 
n 



< Ci 

< C2 



k[l + log 
k(l+ los 



A n 
A n 



a 



a 



These two risk bounds derive straightforwardly from the aforementioned work [18, 
52]. 

B.2 Unknown variance 

For any set 5 G §2, we set the following measure of complexity A (5) 



A(5) 



log 



P 

dim(5) 



+ log(dim(5)) , 



and we take the same penalty term pen(S') as for LinSelect (see Appendix C.l). 
Baraud et al. [12] consider the model selection estimators fi^^^ := /^^g^^j with 



S 



BGH 



:= argmin \\Y — IlsY\\^ 

S&2 



1 + 



pen(S') 



n — dim(5) 

The first risk bound only covers the (non-ultra) high-dimensional setting. 

Proposition B.2. There exists some numerical constant C such that the 
following holds. For any design X and any vector /3o, we have 



n 



/3 



BGH. 



/3o 



< C 



inf 

ll/3||0 < 2 1og(p) 



|X(/3-/3o)||i + 



1 + log 



P 



Proposition B.2 is a straightforward consequence of Corollary 1 in [12]. It shows 
that simultaneous adaptation to the variance and the sparsity is possible if we 
restrict ourselves to a non-ultra high-dimensional setting. The next proposition 
complements the risk upper bound of Proposition 2.2. Consider /J*-"-* as a least- 
squares estimator of f3o over M". 
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Proposition B.3. There exist numerical constants C , Ci, and C2 such that 
the following holds. For any design X, any a > 0, and any vector /3o G W, we 
have 



n 



/3^");/3o' 



< Cna\ 



For any design X, any a > 0, any k G {1, . . . , (n — l)/4} and any vector /3o G 
such that Wl^oWo = k, we have 



n 



P^'^^llSo < CiA;log(|)exp 



C.^log(f 
n \k 



a 



The first bound is straiglitforward while the second bound derives from [12]. 

APPENDIX C: COMPLEMENTS ON LINSELECT 

C.l More details on the selection procedure 

The penalty pen(S') involved in the LinSelect criterion (11) is defined by pen(S') 
l.lpenA(5) where penA(5) is the unique solution of 



E 



n — dim(S') 



-A(S) 



where U and V are two independent chi-square random variables with dim(5) + 1 
and n — dim(S') — 1 degrees of freedom respectively. It is also the solution in x of 



-A(S) 



{D + 1)P Fz5+3,7V-1 > X 



N -I 
N{D + 3) 



N -1 / N+1 
X — I Fd+i,n+i > X- 



N 



N{D + 1) 



where D = dim(S'), N = n — dim(S') and F^^r is a Fisher random variable with d 
and r degrees of freedom. 

Proposition 4 in [12] ensures the following upper-bound on penA(S'). For any 

< K < 1, there exists a constant > 1 such that for any S G E fulfilling 

1 < dim(5) V A(S) < m we have 

T>en^{S) < C«(dim(5) V A(5)). 

Conversely, Lemma D.3 in Appendix D ensures that pen^{S) > 2A(S')+dim(S') — 
C for some constant C > 0. 

C.2 A general risk bound for LinSelect 

We set 

S = a2^e-'^(^). (C.l) 

ses 



The following proposition gives a risk bound when selecting A by minimizing (11). 
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Proposition C.l. Assume that 1 < dim(5') < n/2 - 1 and A{S) < 2n/3 for 
all S £ S. Then, there exists a constant C > 1 such that for any minimizer A of 
the Criterion (11), we have 



< E 



inf <; IIX^A - X/3o||l + inf jllX^A - I^-S^AWl + [A(5) V dim(S)]f72| 

AeA I 56S 



Furthermore, with probability larger than 1 — e ^'^^ — CiY^g^^e C'2[A(S')An]g A(S)^ 
we have for some C > 1 



X/3o - XA 



AeA 



< inf ||X/3a-X/3o||^ + 



inf jllX^A - n^X^Ai + [A(5) Vdim(5)]a2|| . 



The first part of Proposition C.l is a slight variation of Theorem 1 in [13]. We 
refer to the Appendix D.l for a sketch of the proof of this result. The second part 
is proved in Appendix D.2. 



APPENDIX D: PROOF OF PROPOSITION C.l 

D.l Proof of the first part of Proposition C.l 

In this section C denotes a constant whose value may vary from line to line. 
We also use in this section the notations ||.|| for ||.||2, /o = X/3o and fx = X/3a. 
Finally, for any 5 G S, we write S for the linear space generated by S and /q. Let 
(A, 5*) be any minimizer over A x § of 



Crit(A,5) 



Y - Usf> 



+ 



1 



fx - lis fx 



+ pen(5)a|. 



Prom Crit(A,5*) < Crit(A,S') and simple algebra, we get for any K > 1, X £ A 
and 5 e § 



fo-'^sJx 
< 



< 



2 1 

+ ^ 



/o - n^/A 


2 1 

+ 2 


fx - Us fx 


+ 2{e,nsJj^- fo) 


— pen(S'=K)a 


fo - Us fx 


2 1 

+ 2 


7a - Us fx 



2pen(S')CT 



fo-'^lsJx 
fo - nsfx 



+ K\\Us^e 



+ K\mo£\ 



^ + 2pen(5)a| 

2 



pen(S')a|. 



pen(S'*)a5.^ 
pen(5)CT|, 



the second inequality following from 2{f,g) < K ^ ||/||^ + ^ llffll^- Introducing 



the notation 



,2 pen(5) ^ ^^,,2 

' n — dim(S') " ^ 
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we can reformulate the above bound as 



2 + 



1 



1 - K-^ 

-1 



-1 



/o - h 



2 1 

+ 2 

2 1 

+ 2' 



fx - li-sfx + 2pen(5)a| + t. 



(D.l) 



For any S £ S we have dim(S') < n/2 — 1 and A(S) < 2n/3. Therefore, according 
to Proposition 4 in [12] we have pen(5) < C[dim(5) V A(5)] and then 



< 3 



n — dim(5) 
pen{S) ( 2 



n — dim(S') 



n — dim(S') 



eir + ii/o- Mr + ii/A-n5/A 



< C [dim(5) V A(5)]a2 + (\\ef - 2na^)_^ + ||/o - fxf + ||/a - Ilsfx 



where C is a positive constant. Combining this bound with (D.l) and 



{1 + K-'] 



h - ^sf. 



2 1 

+ 2 



fx - ^sf. 



< 4 



/o — fx 



+ 5 



fx - n^/A 



we finally obtain that for any A E A and G S 

-1 - 2 



fo-fj < \\fo-fxf+\\fx-Usfxf+[dim{S)VA{S)]a^+t+{\\ef - 2na^) ^ 

(D.2) 

for some positive constant C depending on K only. Finally, choosing K = 1.1, 
we deduce the upper bound 



E 



E + (||ef - 2ncr2) 1 < 2S + 3ct^ (with S defined in (C.l)) 



from the definition of pen^(S') and the fact that — n^^H is independent of 

1 1 1 1 2 II 1 1 2 

||n^e|| and is stochastically larger than \\e — n^e|| . The bound (C.2) follows. 
D.2 Proof of the second part of Proposition C.l 

We use the same notation as in Section D.l. By (D.2), we have 









fo-fx 


^ < inf 1 







inf { ll/o - fxf + inf |||/a - Us fxf + [dim(5) V A{S)]a^} 



+S+ lid 



2na' 



for some positive constant C depending on K only. Setting K = 1.02, we shall 
prove that with overwhelming probability (||e|p — 2ncr^)+ and 



E := 2^ (l.02||%e| 
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are non positive. Applying a classical deviation inequality for random variables 
(Lemma 1 in [49]), we derive that P [||e|p > 2na'^] < e""/^*^. Let us turn to S. The 
random variable (n — dim(S') — l)||n^e|p/||y — n^(y)|p is stochastically smaller 
than a variable Fs such that -F5'/(dim(5') + 1) follows a Fisher distribution with 
dim(5) + 1 and n — dim(5) — 1 degrees of freedom. As a consequence, we have 

dim(5) - 1 



S > 01 < 

5es 



Fs > 



1.1 n 

L02~ 



n 



dim(S') 



pen^(5) 



(D.3) 



In order to upper bound the right hand-side of (D.3), we control the penalty 
terms pen^(S'). We have 

dim(S') 



E 



U 



n 



n 



dim(S') 



^-pen^(5)VFj 



-A(5) 



where U and (n — dim(S') — 1)W are two independent random variables with 
respectively dim(S') + 1 and n — dim(S') — 1 degrees of freedom. We prove in the 
next sections the three following technical lemmas. 



Lemma D.l. Let F = U/W and < a < 1. We have 



1 n-dim(S')-l 

F > 1 PenA(g) 

I — a n — a\m[b) 



-A(S) 



< 



a(dim(5) + 1) 



Lemma D.2. Assume that dim(5') < n/2 — 1. For any u > 1 and for any 
X > 0, we have 

i- 1 



' (F > ux) < exp 



12u 



{{x - dim(S') - 1) An} 



'(F > x) 



Lemma D.3. For all S £ S, we have 
n — dim(5) — 1 



n — dim(5) 
where C is a positive constant. 



Ven^iS) > 2A{S) + dim{S) - C , 



We can now complete the proof of Proposition C.l. Applying Lemma D.l with 
1/(1 - a) = 1.1/1.05 and Lemma D.2 with u = 1.05/1.02 and 

1.1 n — dim(5) — 1 
= LOS ^ n-dim(5) ^^'^^^^ ' 

we derive from (D.3) the following upper bound. 

P S > < J]] exp [-C2 {{xs - dim(S') - 1} A n)] P [Fs > xs] 

< ^ Ci exp [-C2 {{xs - dim(5) - 1} A n)] e'^^^^ 

< ^C7iexp[-C2(A(5) An)]e-^(^). 
56S 

The proof of the second part of Proposition C.l is complete. 
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D.3 Proof of the technical Lemmas D.l, D.2 and D.3 



D. 3. 1 Proof of Lemma D. 1 
Since U is independent of W and a; — )■ (1 — is increasing for all y > we 

have 



-A(5) 



E 



n — dim(o j 



> E[C7]E 



n-dim(5')-l /ca/z? 
' - n-dim(5) P^"A(g)/i^ 



> (dim(5) + 1) X aP f 1 - ^ W^a{S)/F > a] . 

\ n — dim(D j / 

D.3. 2 Proof of Lemma D.2 
Note that the bound is trivial if x < dim(S) + 1. In the sequel, we assume that 
X > dim(S') + 1. We set di = dim(S') + 1, d2 = n — dim(5) — 1 and write B{., .) for 
the Beta function. Since diF follows a Fisher distribution with (^1,^2) degrees 
of freedom, we have 



' (F > ux) 



+00 



X 



ux {t + d2)^''-+^^)lHB{di/2,d2/2) 

+^ {utf-'^dj''^ 

{ut + d2y'^^+'^^yHB{di/2, d2/2) 



dt 



dt 



+00 



t + d2 



(di+d2)/2 



< 



U 



di/2 



X + d2 
UX + d2 

di + d2 



ut + d2 

(dl+d2)/2 



{t + d2)^'^^+'^^yHB{di/2, d2/2) 



dt 



'{F>x) 



udi + d2 
'{F>x) . 



(<il+d2)/2" 



(x + d2){udi + d2) 
{ux + d2){di + d2) 



(di+d2)/2' 



In order to conclude, we shall prove that the first term between brackets is smaller 
than one and we shall control the second term. The derivative of the function 



g : u ^ log 



A/2 



di + d2 
udi + d2 



{di+d2)/2 



is g'{u) 



di + d2 
udi + d2 



which is non positive for any u > 1. Since g{l) = 0, we conclude that the first 
term is smaller than one. Let us turn to the logarithm of the second term: 



di + d2 



log 



ux + d2 di + d2 
x + d2 udi + d2 



di + d2 



log 



1 + 



< 



< 



di + d2 



d2{u — l)(x — di) 
{x + d2){udi + (^2) 
d2{u - l){x - di) 



2 {x + d2){udi + d2) + d2{u - l){x - di) 

1 



2u 



{x - di) 



X X — d\ 

-:r + ^ + -^ T- 

d2 d2 + di 
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where the last hne is proved by considering separately x < di + d2 and x > di + d2 
and by using di < d2 < n/2. 

D.3.3 Proof of Lemma D.3 
We recall that the penalty pen^(5) is defined by 



E 



n — dim(S') 



-A(S) 



where x+ denotes the positive part of x G M and V are two independent 
random variables with respectively dim(S) + 1 and n — dim(S') — 1 degrees of 
freedom. Let us lower bound this expectation applying Jensen's inequality. 



E 



U- P""^(^) V 

n — dim(S') 



> E 

> E 



n-dim(5)-l 
U — 77:;^pen^(5') 



n — dim(S') 
n — dim(S') — 1 



pen^(S) + 1 



n — dim(S') 

where stands for the indicator function of the event A. Hence, we get 



pen^(S') > 



n — dim(S') 
n — dim(S') — 1 



^dim(S)+l 



-A{S) 



1 



(D.4) 



where Xdim(5)+i('^) is a 1 — a quantile of a random variable with dim(5') + 1 
degrees of freedom. 



Let us note k = dim(5) + 1. For any positive number x, we have 
F[U>x + k] = 



l.k/2-l 
> g-(x+fc)/2 « 







2'=/2r(A:/2) 



2*^/2r(A;/2) Jo 



exp 



dt 



-> g-(x+fc)/2 



-1 



2'=/2r(A;/2) Jo 



Vk 



exp 



t 
k 



k 



dt 



since log(l + t)>t- /2. It follows that 



-1 



2^/'^V{k/2) Jo 



Vk ^ uk/2-1/2 

e-ie-V(4v^)di>C7e-(-+'=)/2- 



2'=/2r(A;/2) 



By Stirling's expansion T{k/2) < (A;/2)'=/2-i/2e-A:/2^/2^ ^i^^^ F[U >x + k] > 
Ce-^/2. It follows that 



XdiL(5)+i [^-''^'^) > + dim(5) + 1-C. 

APPENDIX E: PROOF OF THE SPECIFIC BOUNDS FOR 
LASSO-LINSELECT AND GROUP-LASSO-LINSELECT 



E.l Size of the support of the Lasso and Group-Lasso estimators 



For /C C {1, . . . , M}, we recall that denotes the largest eigenvalue of X^^X(yc). 
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Lemma E.l. Let IC\ be the subset of groups selected by the group-Lasso esti- 
mator j5\. Then, on the event A\ = flfcli |ll^Gfe^ll2 ^ -^fc/4| we have 

Af, <160(^^)||X^A-X/3o||i. 

kdKx 

In particular, for the Lasso estimator (3^, we have the upper bound 

A2||^f||o<16<A,„pp(^.)||X^f-X/3oi 

on the event A\ = {|X-^e|£oo < A/4|. 

The proof of this lemma is delayed to the Appendix E.4. The above bounds are 
similar to those stated in Bickel et al. [17] and Lounici et al. [54], except that it 
involves the restricted eigenvalue 4^0^^-^ instead of the largest eigenvalue (/)max of 

X. When \1C\\ is small compared to n the restricted eigenvalue <PfjQ^-^ can be 
much smaller than (/>max- Actually, since X"^ X has at most n non-zero eigenvalues 
and Tr(X'^X) = we always have (/>max >p/n which can be large when p^ n. 

E.2 Proof of Proposition 4.3 



The first step is to provide a sufficient condition for having ||/3a||o ^ '^/(31og(p)). 
Recall that the compatibility constant k,[(,,T] is defined in Section 4.3. 

Lemma E.2. Assume that A > 8ay^log{p) and 

1 < mo < X . (E.l) 

96 0^, log(p) 

Then, on the event A = ||X-^e|£oc < 2a^J\og{p)^ we have ||/3a||o < n / {'i\og{p)) . 

Proof of Lemma E.2. We write J for the support of /?a- A shght variation 
of Theorem 14 in [48] ensures that 

IIX& - XAII^ < W {lIXA - X^ll^ + -,j^-|-^||^||„} (E.2) 

on the event A. Combining Lemma E.l with the bound (E.2) we obtain that 

Card(J) < 16 0^^J^^^-— . 

^ K^[5,supp(/3o)] 

Let us set d* = n/[31og(p)]. The upper-bound (t>j< {l + Cai(l{J) / d*)(j)^, enforces 

< [d* + Card( j")) /2 , 
where the last inequality follows from (E.l). □ 



Card( J) < . 

«;^[5,supp(/3o)] 



-I I Card(J) 
d* 
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We can now complete the proof of Proposition 4.3. We recall that the event 
^= {iX^el^co < 2ayiog(p)} has probability at least 1 — \/p. Let us set 

Ao = x/l6(4V(/.*)log(p)a2 > 8cjVlog(p). 

Under the hypothesis (E.l), the combination of Lemma E.2 with Proposition C.l 
ensures that with probability larger than 1 — Cip~^'^ we have 

X/3o-X^3;' < C{||X/3o-X^Aoll2 + [II^AolloVl]log(p)a2}. 



We upper bound the right-hand side by combining Lemma E.l with (E.2) 
X/3o - X^3^ 



2 / IQd) ^logip^a"^^ 

< C 1 + - ^ 



X2 
-^0 



X2 



X inf <^ ||X/3o-X/3||^ + -^, 

'* log (p) 



< C'inf j||X/3o-X/3||2 + ^ 



i2[5,supp(/3)] 

where we used in the last inequality that J (the support of /3ao ) is of size at most 
n/(31og(p)). 

E.3 Proof of Proposition 5.1 



The proof of Proposition 5.1 is very similar to that of Proposition 4.3. We only 
sketch the main lines. The first step is to provide a sufficient condition for having 
I^aI 1^ {n — 2)/(2T V31og(M)). Recall that the compatibility constant kg[S,,s] is 
defined in (20) and (p^: in (19). 



Lemma E.3. Assume that 

96(?i*(^V31og(M))o-^ forfc = l, 



and 1 < |/Co| < 



4[3,|/Co|] 



n 



(E.3) 
(E.4) 



2V* 2TV31og(M)' 
Then we have \ jCx\ <{n- 2)/(31og(M) V 2T), with probability at least 1 - 3/M. 



Proof of Lemma E.3. We set k* = {n - 2)/(31og(M) V 2T). Theorem 3.1 
in [54] gives 

K^[3, |/Co|J 

with probability larger than 1 — 3/M. Combining this bound with Lemma E.l 
< [1 + |/Ca|/^*]</'*i we get that with probability larger than 



and the bound 



1 - 3/M 



< 



< 



28 



1 + 



|/Co| < (k* + \}Cx\)/2, 



where the last bound follows from (E.4). 



□ 
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We complete now the proof of Proposition 5.1. Assume that (E.3) and (E.4) are 
satisfied. Combining Lemma E.3 with Proposition C.l ensures that with proba- 
biUty larger than 1 — CiM~^'^ — 3/M we have 



-1 



2 



X/3o-X/33^^ < ||X/3A-X/3o||^ + (lV|/CA|)(TVlog(M))a^ 



< 2(||X^A-X/3o||iv [(rvlog(M))a^ 



C 



Proposition 5.1 then simply follows from (E.5). 

E.4 Proof of Lemma E.l 

We write (3 for /3a, fC for }C\ and for the Moore-Penrose pseudo-inverse of 
A. The optimality condition gives 

where ||2*^''||2 = 1 for all A; € /C. As a consequence we have 



and 

where P^^^^ is the orthogonal projector onto the range of Pythagorean 
equality gives 



||X/3o-X/3||2 = ||X/3o-P(^)X/3o||i + ||P(^)e-AX(^)(X[^^X(^))+Z(^)/2||2 

Prom (E.6) we know that the vector X^-, e — Az/pN /2 belongs to the range of Xf-, 
and therefore (see Lemma E.4 below) 

Finally, on the event ^a we have ||X^^e — Afc2:^'=/2||2 > Afc/4 for all A; E /C, so 

This allows to conclude. 

Lemma E.4. Let A be any n x d real matrix. Then for any x in the range of 
we have 

\\X\\1 < V^max(^^A) \\A{A^A)+X\\l 

where ip^g.xiA'^ A) denotes the largest eigenvalue of A^A. 
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Proof of Lemma E.4. We first note that 

\\A{A'A)+x\\l = x^{A'A)+A'A{A'A)+x = x^{A'^A)+x. 

Furthermore the range of coincides with the range of A^A, which in turn is 
the same as the range of (A^A)^. We then have 

^rank((ATA) + )((^^^) + )lk||i < X^ {A'^ A)+ X 

where ak{{A'^ A)^) is the k-th largest singular value of (A^A)'^. The result follows 
from the equality 

^rank((A^A) + )((^^^) + )l ' = ^l{A^ A) = <f^^^{A'^A). 



□ 



